04:00
2026-06-16
arxiv.org
large-language-models
Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability
Researchers developed Metric Match, a subset selection method that estimates LLM judge reliability from limited human annotations. The method achieved a win-rate of 0.838 against random selection acroβ¦